5. Decision Trees- Classifier
Aim
To write a python program for decision tree classifier using scikit learn module to classify Iris flower data set
Understand the Decision Trees- Classifier Before You Begin
Overview: Decision Tree Classifier is a supervised machine learning algorithm used for both classification and regression tasks. It works by recursively splitting the dataset into subsets based on feature values, creating a tree-like model where each internal node represents a decision based on a feature, each branch represents the outcome of the test, and each leaf node represents a class label or regression value.
The algorithm selects the best splits using criteria like Gini impurity, entropy, or information gain to maximize the purity of child nodes. Decision Trees are widely used for medical diagnosis, credit risk assessment, customer churn prediction, and feature selection due to their interpretability.
Further Understanding: Decision Trees
Algorithm
- Load the Dataset: Load the Iris dataset using load_iris.
- Binarize the Target: Extract "sepal length" and "sepal width" as the feature set X, and the target labels as y.
- Select Features and Labels: Prepare the feature set X and target labels y for the KNN model.
- Split the Dataset: Split the dataset into training and testing sets using train_test_split, ensuring class distribution is maintained with stratify=y.
- Create a Pipeline: Create a pipeline with StandardScaler for feature scaling and KNeighborsClassifier for the KNN model.
- Initialize Plot: Create a figure with two subplots to visualize the KNN decision boundaries with different weight strategies.
- Iterate Over Weight Strategies: For each subplot, set the KNN weight strategy (uniform or distance) and fit the model to the training data.
- Visualize Decision Boundaries: Plot the decision boundary for each weight strategy to show classification regions.
About Iris Dataset
The data sets consists of 3 different types of irises’ (Setosa, Versicolour, and Virginica) petal and sepal length, stored in a 150x4 numpy.ndarray. The rows being the samples and the columns being: Sepal Length, Sepal Width, Petal Length and Petal Width.
Dataset Information
| Number of Instances | 150 (50 in each of three classes) |
|---|---|
| Number of Attributes | 4 numeric, predictive attributes and the class |
| Attribute Information |
|
| Classes | 3 (Iris-Setosa, Iris-Versicolour, Iris-Virginica) |
Source: Dataset Link
Visualization
Interactive Visualization of Decision Trees- Classifier.
Pre-Lab Questions
- Why decision tree is prepared to use with an ensemble approach?
- What is Information Gain? How it is related to decision tree?
- What is Gini index? How it is related to decision tree?
Post-Lab Questions
- Apply standard scaling to the data set and give your observation/comments on the performance.
- Run the code for the wine dataset from scikit learn module and presents the results.
Result
The decision tree classifier was successfully implemented on the Iris dataset. The model achieved high accuracy, and the resulting tree structure and confusion matrix clearly demonstrated effective classification across all three iris flower species.